Introduction

Row {data-width = 600, .scrollable-section}

Introduction

Data was extracted from the Starbucks Coffee Company Beverage Nutrition Information pdf. It contains nutritional information on all drinks in the starbucks menu. More information on the dataset can be found here.

Purpose of this analysis report:

We will explore and analyze the starbucks dataset to find out how healthy different drinks are using variables like the amount of sugar, trans fats and proteins. This analysis can provide insight on which drink choices can affect consumer health for the better or worse.

There are four main areas of analysis for this report: * With the help of the Harmonized Learning Outcomes (HLO) to measure the human capital across different regions. This will help to understand and interpret the education promise for different countries, subdivided into various income groups. Moreover, it may provide some insights on how to improve the human capital development issue over the world.

There are four main areas of analysis for this report:

  • Part 1: To obtain an overview of the starbucks dataset and see which drinks have the most calories.

    • Research Question: Q1 Which drinks have the most calories?
  • Part 2: To compare the nutritional difference among coffee based drinks and tea based drinks.

    • Research Question: Q2 Do coffee based drinks contain more calories than tea based drinks?
  • Part 3: To compare drinks containing different types of milk.

    • Research Question: Q3 How do drinks with different types of milk compare in terms of nutrients? Does the type of milk used in the drink affect the nutritional values?
  • Part 4: Application on model analysis, to examine any possible linear relationship between variables.

    • Research Question: Q4: Which variables affect the amount calories? What conclusion can you make?


Part 0: Data Cleaning

Row

Missing NAs in the variables

There are no missing values in the dataset.

Checking for duplicates

[1] 0

Apparently there is no duplicates in the dataset.

Checking for capitalization

 [1] "short"   "tall"    "grande"  "venti"   "trenta"  "solo"    "doppio" 
 [8] "triple"  "quad"    "1 scoop" "1 shot" 
 [1] "brewed coffee - dark roast"                     
 [2] "brewed coffee - decaf pike place roast"         
 [3] "brewed coffee - medium roast"                   
 [4] "brewed coffee - True North Blend Blonde roast"  
 [5] "Caffè Misto"                                    
 [6] "Clover Brewed Coffee - Dark Roast"              
 [7] "Clover Brewed Coffee -  Light Roast"            
 [8] "Clover Brewed Coffee - Medium Roast"            
 [9] "Iced Coffee"                                    
[10] "Iced Coffee with milk"                          
[11] "Cold Brewed Coffee"                             
[12] "Vanilla Sweet Cream Cold Brew"                  
[13] "Espresso - Caffè Americano"                     
[14] "Espresso - Iced Caffè Americano"                
[15] "Caffè Latte"                                    
[16] "Iced Caffè Latte"                               
[17] "Caffè Mocha"                                    
[18] "Iced Caffè Mocha"                               
[19] "Cappuccino"                                     
[20] "Caramel Macchiato"                              
[21] "Iced Caramel Macchiato"                         
[22] "Cinnamon Dolce Latte"                           
[23] "Espresso"                                       
[24] "Espresso con panna"                             
[25] "Espresso Macchiato"                             
[26] "Flat White"                                     
[27] "Latte Macchiato"                                
[28] "Skinny Cinnamon Dolce Latte"                    
[29] "Iced Skinny Cinnamon Dolce Latte"               
[30] "Skinny Mocha"                                   
[31] "Iced Skinny Mocha"                              
[32] "Starbucks Doubleshot on ice"                    
[33] "White Chocolate Mocha"                          
[34] "Iced White Chocolate Mocha"                     
[35] "Iced Black tea"                                 
[36] "Iced Black tea Lemonade"                        
[37] "Chai tea Latte"                                 
[38] "Iced Chai Tea Latte"                            
[39] "Earl Grey Brewed Tea"                           
[40] "Emperor's Clouds and Mist Brewed Tea"           
[41] "English Breakfast Black Brewed Tea"             
[42] "English Breakfast Black Tea Latte"              
[43] "Green Tea Latte"                                
[44] "Iced Green Tea Latte"                           
[45] "Iced Green Tea"                                 
[46] "Iced Green Tea Lemonade"                        
[47] "Jade Citrus Mint Brewed tea"                    
[48] "London Fog Tea Latte"                           
[49] "Iced Mango Black Tea"                           
[50] "Iced Mango Black Tea Lemonade"                  
[51] "Mint Majesty Brewed Tea"                        
[52] "Oprah Chai Herbal Brewed Tea"                   
[53] "Oprah Cinnamon Chai Brewed Tea"                 
[54] "Oprah Cinnamon Chai Latte"                      
[55] "Iced Oprah Cinnamon Chai Latte"                 
[56] "Passion Tango Brewed Tea"                       
[57] "Iced Passion Tango Tea"                         
[58] "Iced Passion Tango Tea Lemonade"                
[59] "Peach Iced Green Tea"                           
[60] "Peach Iced Green Tea Lemonade"                  
[61] "Peach Tranquility Brewed Tea"                   
[62] "Youthberry Brewed Tea"                          
[63] "Cool Lime Starbucks Refreshers"                 
[64] "Very Berry Hibiscus Starbucks Refreshers"       
[65] "Chocolate Smoothie"                             
[66] "Orange Mango Smoothie"                          
[67] "Strawberry Smoothie"                            
[68] "Caffè Vanilla Frappuccino Blended"              
[69] "Caramel Frappuccino Blended"                    
[70] "Coffee Frappuccino Blended"                     
[71] "Espresso Frappuccino Blended"                   
[72] "Java Chip Frappuccino Blended"                  
[73] "Caffè Vanilla Frappuccino Light"                
[74] "Caramel Frappuccino Light"                      
[75] "Coffee Frappuccino Light"                       
[76] "Espresso Frappuccino Light"                     
[77] "Java Chip Light Frappuccino"                    
[78] "Mocha Light Frappuccino"                        
[79] "Blended Strawberry Lemonade"                    
[80] "Chai Crème Frappuccino Blended"                 
[81] "Double Chocolaty Chip Crème Frappuccino Blended"
[82] "Green Tea Crème Frappuccino Blended"            
[83] "Oprah Cinnamon Chai Crème Frappuccino Blended"  
[84] "Strawberries & Crème Frappuccino Blended"       
[85] "Vanilla Bean Crème Frappuccino Blended"         
[86] "Caramel Apple Spice"                            
[87] "Hot Chocolate"                                  
[88] "Lemonade"                                       
[89] "Skinny Hot Chocolate"                           
[90] "White Hot Chocolate"                            
[91] "Protein & Fibre Powder"                         
[92] "Matcha Green Tea Powder"                        
[93] "Espresso shot"                                  

Text seems to be consistent.

Checking for Data Types

spc_tbl_ [1,147 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ product_name   : chr [1:1147] "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" ...
 $ size           : chr [1:1147] "short" "tall" "grande" "venti" ...
 $ milk           : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ whip           : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ serv_size_m_l  : num [1:1147] 236 354 473 591 236 354 473 591 236 354 ...
 $ calories       : num [1:1147] 3 4 5 5 3 4 5 5 3 4 ...
 $ total_fat_g    : num [1:1147] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
 $ saturated_fat_g: num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ trans_fat_g    : chr [1:1147] "0" "0" "0" "0" ...
 $ cholesterol_mg : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ sodium_mg      : num [1:1147] 5 10 10 10 5 10 10 10 5 5 ...
 $ total_carbs_g  : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ fiber_g        : chr [1:1147] "0" "0" "0" "0" ...
 $ sugar_g        : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ caffeine_mg    : num [1:1147] 130 193 260 340 15 20 25 30 155 235 ...
 - attr(*, "spec")=
  .. cols(
  ..   product_name = col_character(),
  ..   size = col_character(),
  ..   milk = col_double(),
  ..   whip = col_double(),
  ..   serv_size_m_l = col_double(),
  ..   calories = col_double(),
  ..   total_fat_g = col_double(),
  ..   saturated_fat_g = col_double(),
  ..   trans_fat_g = col_character(),
  ..   cholesterol_mg = col_double(),
  ..   sodium_mg = col_double(),
  ..   total_carbs_g = col_double(),
  ..   fiber_g = col_character(),
  ..   sugar_g = col_double(),
  ..   caffeine_mg = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Variables milkand whip are numeric. We will convert them to factor and boolean respectively. Variable size is character and will be converted to factor. fiber_g and trans_fat_g seem to be characters, they will be converted to numeric.

Part 1: Data Exploration

Row

By invoking the head() function, we can have a sneak peak of the data.

#sneak peak of data
head(starbucks)
# A tibble: 6 × 15
  product_name              size  milk  whip  serv_size_m_l calories total_fat_g
  <chr>                     <fct> <fct> <lgl>         <dbl>    <dbl>       <dbl>
1 brewed coffee - dark roa… short none  FALSE           236        3         0.1
2 brewed coffee - dark roa… tall  none  FALSE           354        4         0.1
3 brewed coffee - dark roa… gran… none  FALSE           473        5         0.1
4 brewed coffee - dark roa… venti none  FALSE           591        5         0.1
5 brewed coffee - decaf pi… short none  FALSE           236        3         0.1
6 brewed coffee - decaf pi… tall  none  FALSE           354        4         0.1
# ℹ 8 more variables: saturated_fat_g <dbl>, trans_fat_g <dbl>,
#   cholesterol_mg <dbl>, sodium_mg <dbl>, total_carbs_g <dbl>, fiber_g <dbl>,
#   sugar_g <dbl>, caffeine_mg <dbl>
#there are 1147 rows and 15 columns
#show dimensions of the dataset
dim(starbucks)
[1] 1147   15
#get names of columns
names(starbucks)
 [1] "product_name"    "size"            "milk"            "whip"           
 [5] "serv_size_m_l"   "calories"        "total_fat_g"     "saturated_fat_g"
 [9] "trans_fat_g"     "cholesterol_mg"  "sodium_mg"       "total_carbs_g"  
[13] "fiber_g"         "sugar_g"         "caffeine_mg"    

We can observe there are 1147 rows and 15 variables.

summary(starbucks)
 product_name            size          milk        whip         serv_size_m_l  
 Length:1147        grande :334   none   :165   Mode :logical   Min.   :  0.0  
 Class :character   venti  :320   nonfat :222   FALSE:864       1st Qu.:354.0  
 Mode  :character   tall   :318   2%     :190   TRUE :283       Median :473.0  
                    short  :123   soy    :190                   Mean   :461.3  
                    trenta : 21   coconut:190                   3rd Qu.:591.0  
                    doppio :  7   whole  :190                   Max.   :887.0  
                    (Other): 24                                                
    calories      total_fat_g     saturated_fat_g   trans_fat_g    
 Min.   :  0.0   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
 1st Qu.:130.0   1st Qu.: 1.000   1st Qu.: 0.200   1st Qu.:0.0000  
 Median :220.0   Median : 4.500   Median : 2.500   Median :0.0000  
 Mean   :228.4   Mean   : 6.186   Mean   : 3.881   Mean   :0.1212  
 3rd Qu.:320.0   3rd Qu.:10.000   3rd Qu.: 7.000   3rd Qu.:0.2000  
 Max.   :640.0   Max.   :28.000   Max.   :20.000   Max.   :2.0000  
                                                                   
 cholesterol_mg    sodium_mg     total_carbs_g      fiber_g      
 Min.   : 0.00   Min.   :  0.0   Min.   : 0.00   Min.   :0.0000  
 1st Qu.: 0.00   1st Qu.: 70.0   1st Qu.:20.00   1st Qu.:0.0000  
 Median : 5.00   Median :135.0   Median :37.00   Median :0.0000  
 Mean   :15.24   Mean   :139.7   Mean   :37.72   Mean   :0.8657  
 3rd Qu.:30.00   3rd Qu.:200.0   3rd Qu.:53.00   3rd Qu.:1.0000  
 Max.   :75.00   Max.   :370.0   Max.   :96.00   Max.   :9.0000  
                                                                 
    sugar_g       caffeine_mg    
 Min.   : 0.00   Min.   :  0.00  
 1st Qu.:18.00   1st Qu.: 30.00  
 Median :34.00   Median : 75.00  
 Mean   :34.99   Mean   : 91.86  
 3rd Qu.:49.00   3rd Qu.:150.00  
 Max.   :89.00   Max.   :475.00  
                                 

For the variable cholesterol_mg we can see a difference between the mean and the median.

starbucks %>%
  arrange(desc(calories))
# A tibble: 1,147 × 15
   product_name             size  milk  whip  serv_size_m_l calories total_fat_g
   <chr>                    <fct> <fct> <lgl>         <dbl>    <dbl>       <dbl>
 1 White Hot Chocolate      venti whole TRUE            591      640          28
 2 Iced White Chocolate Mo… venti whole TRUE            709      630          27
 3 White Chocolate Mocha    venti whole TRUE            591      620          27
 4 Iced White Chocolate Mo… venti 2%    TRUE            709      600          24
 5 Java Chip Frappuccino B… venti whole TRUE            709      600          22
 6 White Hot Chocolate      venti 2%    TRUE            591      590          23
 7 White Chocolate Mocha    venti 2%    TRUE            591      580          22
 8 Iced White Chocolate Mo… venti soy   TRUE            709      580          23
 9 Java Chip Frappuccino B… venti 2%    TRUE            709      580          20
10 Iced White Chocolate Mo… venti nonf… TRUE            709      570          20
# ℹ 1,137 more rows
# ℹ 8 more variables: saturated_fat_g <dbl>, trans_fat_g <dbl>,
#   cholesterol_mg <dbl>, sodium_mg <dbl>, total_carbs_g <dbl>, fiber_g <dbl>,
#   sugar_g <dbl>, caffeine_mg <dbl>

Here we can see a ranking of drinks with the most calories. White Hot Chocolate Mocha (591 ml) has the highest amount of calories.

We can visualize the distribution of calories using a histogram.

Part 2: Coffee vs Tea

Row

We can observe that in both histograms, coffee and tea based drinks follow a similar distribution.

Now let’s compare drinks containing no milk. Drinks with coffee and no milk follow a left skewed distribution.

Part 3: Differences between drinks containing milk

Row

By creating a boxplot of the amount of calories for drinks divided by milk type, we can observe that whole milk has the most calories. However, Coconut milk has the most saturated fats.

Part 4: Analyzing variables and producing models.

Row

First we start by checking for correlation. Most numeric variables have correlation of 0.5 and above between each other, only exception is caffeine_mg. It shows low correlation with all other variables.

Since the assumption of linearity does not hold for our response variable with most predictors except caffeine_mg. We then used linear regression to check if the response variable was being affected by the output.


Call:
lm(formula = calories ~ caffeine_mg, data = starbucks)

Residuals:
    Min      1Q  Median      3Q     Max 
-240.46 -100.75  -10.75   89.46  419.10 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 240.46421    6.26190   38.40   <2e-16 ***
caffeine_mg  -0.13141    0.05194   -2.53   0.0115 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 137.3 on 1145 degrees of freedom
Multiple R-squared:  0.005559,  Adjusted R-squared:  0.004691 
F-statistic: 6.401 on 1 and 1145 DF,  p-value: 0.01154

The Residuals vs Fitted plot on the upper left corner is not horizontal at zero, assumption of linearity does not hold. The Q-Q plot at the upper right corner shows residuals plots following the straight dashed line, normality assumption holds. Scale-Location plot at the lower left corner of the figure show points scattered in an almost vertical-line above and below the line across the plot. The are no influential according to Residual vs Leverage plot.

Conclusion

Conclusion

Findings from the data analysis:

  1. Analysis shows that White Chocolate Mocha (Venti, Whole Milk) is the drink with the most calories in the dataset, containing 640 calories. Lowest amount of calories with only 0 calories is the Early Gray Morning tea.

  2. From the analysis, both drinks containing coffee and tea on the starbucks menu have similar amount of calories. However, if milk is removed, tea is the option with the lowest amount of calories.

  3. In this part, it shows that drinks with whole milk have the highest amount on of calories, trans fats and total fats compared to drinks containing other types of milk. However, drinks containing coconut milk have a higher amount of saturated fats compared to other milks.

  4. Most variables have a strong correlation between each other, so using multiple linear regression is not possible. However, caffeine_mg has no correlation with other variables so it was modeled against calories but the diagnostic plots that many assumptions did not hold. Therefore, further modeling test will be required for a more confirmed and precise conclusion.


References

Row

Reference

[1] Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

[2] Comprehensive R Archive Network (CRAN). (2021, May 14). CRAN - Package naniar. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://cran.r-project.org/web/packages/naniar/index.html

[3] Auguie, B. (2017, September 9). CRAN - Package gridExtra. GridExtra: Miscellaneous Functions for “Grid” Graphics. https://cran.r-project.org/web/packages/gridExtra/index.html

[4] Comprehensive R Archive Network (CRAN). (2019, May 31). CRAN - Package ggResidpanel. GgResidpanel: Panels and Interactive Versions of Diagnostic Plots using ‘ggplot2’. https://cran.r-project.org/web/packages/ggResidpanel/index.html

[5] Comprehensive R Archive Network (CRAN). (2021a, April 5). CRAN - Package broom. Broom: Convert Statistical Objects into Tidy Tibbles. https://cran.r-project.org/web/packages/broom/index.html

[6] Müller, K. (2020, December 13). CRAN - Package here. Readxl: Read Excel Files. https://cran.r-project.org/web/packages/here/index.html

[7] Comprehensive R Archive Network (CRAN). (2019a, March 13). CRAN - Package Readxl. R-Project. https://cran.r-project.org/web/packages/readxl/index.html

[8] Comprehensive R Archive Network (CRAN). (2020, October 5). CRAN - Package readr. readr: Read Rectangular Text Data. https://cran.r-project.org/web/packages/readr/index.html

[9] Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra

[10] Xie, Y. (2021, April 22). CRAN - Package bookdown. bookdown: Authoring Books and Technical Documents with R Markdown. https://cran.r-project.org/web/packages/bookdown/index.html

[11] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

[12] Comprehensive R Archive Network (CRAN). (2021a, January 10). CRAN - Package plotly. Plotly: Create Interactive Web Graphics via “Plotly.Js.” https://cran.r-project.org/web/packages/plotly/index.html

[13] Tierney, N. (2019, February 15). CRAN - Package visdat. Visdat: Preliminary Visualisation of Data. https://cran.r-project.org/web/packages/visdat/index.html

[14] Comprehensive R Archive Network (CRAN). (2021e, May 5). CRAN - Package dplyr. Dplyr: A Grammar of Data Manipulation. https://cran.r-project.org/web/packages/dplyr/index.html

[15] R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

[16] https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-21/readme.md

[17] https://globalassets.starbucks.com/assets/94fbcc2ab1e24359850fa1870fc988bc.pdf

---
title: "Report on healthiness of Starbuck Drinks using nutirtional information"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    source_code: embed
---
```{css, echo=FALSE}
.section {
  padding-top: 100px !important; /* Adjust this value as needed */
  margin-top: -70px !important; /* Adjust this value if necessary */
}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8) 
library(flexdashboard)
library(tidyverse)
library(visdat)
library(gridExtra)
library(ggcorrplot)
library(rpart)
library(plotly)

```

```{r read-data, cache = TRUE, echo = FALSE}
starbucks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv')

```

Introduction {data-icon="fa-address-book"}
===================================== 

Row {data-width = 600, .scrollable-section}
-----
### **Introduction**

Data was extracted from the [Starbucks Coffee Company Beverage Nutrition Information pdf](https://globalassets.starbucks.com/assets/94fbcc2ab1e24359850fa1870fc988bc.pdf). It contains nutritional information on all drinks in the starbucks menu. More information on the dataset can be found [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-21/readme.md). 

**Purpose of this analysis report**:

We will explore and analyze the starbucks dataset to find out how healthy different drinks are using variables like the amount of sugar, trans fats and proteins. This analysis can provide insight on which drink choices can affect consumer health for the better or worse.

There are four main areas of analysis for this report:
 * With the help of the Harmonized Learning Outcomes (HLO) to measure the human capital across different regions. This will help to understand and interpret the education promise for different countries, subdivided into various income groups. Moreover, it may provide some insights on how to improve the human capital development issue over the world.

There are four main areas of analysis for this report:

* Part 1: To obtain an overview of the starbucks dataset and see which drinks have the most calories. 
  
  - Research Question: Q1 Which drinks have the most calories?

* Part 2: To compare the nutritional difference among coffee based drinks and tea based drinks.

  - Research Question: Q2 Do coffee based drinks contain more calories than tea based drinks? 

* Part 3: To compare drinks containing different types of milk.

  - Research Question: Q3 How do drinks with different types of milk compare in terms of nutrients? Does the type of milk used in the drink affect the nutritional values?

* Part 4: Application on model analysis, to examine any possible linear relationship between variables.

   - Research Question: Q4: Which variables affect the amount calories? What conclusion can you make? 

***

***

Part 0: Data Cleaning 
===================================== 

Row {.tabset data-height=550, .scrollable-section}
-----

### **Missing NAs in the variables**

```{r missing-value-visualization}
starbucks %>% 
  visdat::vis_miss(sort_miss = TRUE)
#there are no missing values

```

There are no missing values in the dataset.

### **Checking for duplicates**

```{r check-duplicates}
#check for duplicates
sum(starbucks %>%
  duplicated())
```
Apparently there is no duplicates in the dataset.

### **Checking for capitalization**

```{r standardize-capitalization}
#check if text is consistent
unique(starbucks$size)
unique(starbucks$product_name)


```

Text seems to be consistent.

### **Checking for Data Types**

```{r check-data-types}
#Check if data types are consistent
str(starbucks)
```

Variables `milk`and `whip` are numeric. We will convert them to factor and boolean respectively. Variable `size` is character and will be converted to factor. `fiber_g` and `trans_fat_g` seem to be characters, they will be converted to numeric. 

```{r change-data-types}
#data cleaning
# whip to binary
# size should be converted to factor
starbucks = starbucks %>%
  mutate(across(c(milk, 
                  size),
                factor)) %>%
  mutate(across(whip, as.logical)) %>%
  mutate(across(c(fiber_g, trans_fat_g), as.numeric))

```


```{r}
starbucks$milk = recode_factor(starbucks$milk, 
                               "0" = "none", 
                               "1" = "nonfat",
                               "2" = "2%",
                               "3" = "soy",
                               "4" = "coconut",
                               "5" = "whole")



```

Part 1: Data Exploration 
===================================== 
Row {.tabset data-height=550, .scrollable-section}
-----

By invoking the `head()` function, we can have a sneak peak of the data. 

```{r overview-starbucks-1, echo = TRUE}
#sneak peak of data
head(starbucks)
```


```{r overview-starbucks-2, echo = TRUE }
#there are 1147 rows and 15 columns
#show dimensions of the dataset
dim(starbucks)
#get names of columns
names(starbucks)

```
We can observe there are 1147 rows and 15 variables.

```{r overview-starbucks-3, echo = TRUE}
summary(starbucks)
```
For the variable `cholesterol_mg` we can see a difference between the mean and the median. 

```{r overview-starbucks-4, echo = TRUE}
starbucks %>%
  arrange(desc(calories))

```
Here we can see a ranking of drinks with the most calories. White Hot Chocolate Mocha (591 ml) has the highest amount of calories. 

We can visualize the distribution of calories using a histogram.

```{r}
ggplot(starbucks, aes(x=calories)) +
  geom_histogram(aes(y=..density..), colour="white", fill="#00704A") + 
  geom_density()

```

Part 2: Coffee vs Tea
===================================== 
Row {.tabset data-height=500, .scrollable-section}
-----

```{r clean-coffee}
tea_based_drinks = starbucks %>% 
  filter(grepl("tea", product_name, ignore.case = TRUE)) %>%
  mutate(has_coffee = FALSE)
coffee_based_drinks = starbucks %>%              filter(grepl("Cof|Caf|Espresso|Mocha|Cappuccino|Mocha|Macchiatto|Caffè Latte", product_name, ignore.case = TRUE)) %>%
  mutate(has_coffee = TRUE)

drinks = rbind(tea_based_drinks, coffee_based_drinks)


```

We can observe that in both histograms, coffee and tea based drinks follow a similar distribution.

```{r}
p1 = ggplot(coffee_based_drinks, aes(x=calories)) +
  ggtitle("Histogram of Calories for drinks containing coffee") +
  geom_histogram(aes(y=..density..), colour="white", fill="#6F4E37") + 
  geom_density()

p2 = ggplot(tea_based_drinks, aes(x=calories)) +
  ggtitle("Histogram of Calories for drinks containing tea") +
  geom_histogram(aes(y=..density..), colour="white", fill="orange") + 
  geom_density()




grid.arrange(p1,p2, nrow = 2)


```


Now let's compare drinks containing no milk. Drinks with coffee and no milk follow a left skewed distribution.


```{r}

drinks %>% 
  filter(milk == "none") %>%
  ggplot(aes(x=calories, color = has_coffee)) +
    ggtitle("Histogram of Calories for drinks containing coffee and no milk") +
    geom_histogram(aes(y=..density..), colour="white", fill="brown") + 
    geom_density()



```

Part 3: Differences between drinks containing milk
===================================== 

Row {.tabset data-height=550 data-width=1000, .scrollable-section}
-----

By creating a boxplot of the amount of calories for drinks divided by milk type, we can observe that whole milk has the most calories. However, Coconut milk has the most saturated fats.
```{r}
p1 = starbucks %>%
  ggplot(aes(x=milk, y=calories, color = milk)) + 
    geom_boxplot() + 
    ggtitle("Different boxplots for calories per type of milk")
p2 = starbucks %>%
  ggplot(aes(x=milk, y=saturated_fat_g, color = milk)) + 
    geom_boxplot() + 
    ggtitle("Different boxplots for saturated fat(g) per type of milk")
p3 = starbucks %>%
  ggplot(aes(x=milk, y=trans_fat_g, color = milk)) + 
    geom_boxplot() + 
    ggtitle("Different boxplots for trans fat (g) per type of milk")
p4 = starbucks %>%
  ggplot(aes(x=milk, y=total_fat_g, color = milk)) + 
    geom_boxplot() + 
    ggtitle("Different boxplots for total fat (g) per type of milk")

grid.arrange(p1,p2,p3,p4, nrow = 2, ncol = 2)

```

Part 4: Analyzing variables and producing models.
===================================== 

Row {.tabset data-height=550 data-width=1000, .scrollable-section}
-----

First we start by checking for correlation. Most numeric variables have correlation of 0.5 and above between each other, only exception is `caffeine_mg`. It shows low correlation with all other variables.


```{r}
ggplotly(model.matrix(~0+., data=starbucks %>% select(where(is.numeric))) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = F, type="lower", lab=TRUE))
```

Since the assumption of linearity does not hold for our response variable with most predictors except `caffeine_mg`. We then used linear regression to check if the response variable was being affected by the output.


```{r}
starbucks_lm <- lm(calories~caffeine_mg, data=starbucks)
summary(starbucks_lm)

```
```{r}
par(mfrow=c(2,2))
plot(starbucks_lm)
```

The Residuals vs Fitted plot on the upper left corner is not horizontal at zero, assumption of linearity does not hold. The Q-Q plot at the upper right corner shows residuals plots following the straight dashed line, normality assumption holds. Scale-Location plot at the lower left corner of the figure show points scattered in an almost vertical-line above and below the line across the plot. The are no influential according to Residual vs Leverage plot.

Conclusion
===================================== 


**Conclusion**

Findings from the data analysis:

1. Analysis shows that White Chocolate Mocha (Venti, Whole Milk) is the drink with the most calories in the dataset, containing 640 calories. Lowest amount of calories with only 0 calories is the Early Gray Morning tea. 

2. From the analysis, both drinks containing coffee and tea on the starbucks menu have similar amount of calories. However, if milk is removed, tea is the option with the lowest amount of calories.

3. In this part, it shows that drinks with whole milk have the highest amount on  of calories, trans fats and total fats compared to drinks containing other types of milk. However, drinks containing coconut milk have a higher amount of saturated fats compared to other milks.

4. Most variables have a strong correlation between each other, so using multiple linear regression is not possible. However, `caffeine_mg` has no correlation with other variables so it was modeled against `calories` but the diagnostic plots that many assumptions did not hold.  Therefore, further modeling test will be required for a more confirmed and precise conclusion.

***



References
===================================== 

Row {.tabset data-height=550, .scrollable-section}
-----


**Reference**

[1] Wickham et al., (2019). Welcome to the tidyverse. Journal of Open
  Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

[2] Comprehensive R Archive Network (CRAN). (2021, May 14). CRAN - Package naniar. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://cran.r-project.org/web/packages/naniar/index.html

[3] Auguie, B. (2017, September 9). CRAN - Package gridExtra. GridExtra: Miscellaneous Functions for "Grid" Graphics. https://cran.r-project.org/web/packages/gridExtra/index.html

[4] Comprehensive R Archive Network (CRAN). (2019, May 31). CRAN - Package ggResidpanel. GgResidpanel: Panels and Interactive Versions of Diagnostic Plots using 'ggplot2'. https://cran.r-project.org/web/packages/ggResidpanel/index.html

[5] Comprehensive R Archive Network (CRAN). (2021a, April 5). CRAN - Package broom. Broom: Convert Statistical Objects into Tidy Tibbles. https://cran.r-project.org/web/packages/broom/index.html

[6] Müller, K. (2020, December 13). CRAN - Package here. Readxl: Read Excel Files. https://cran.r-project.org/web/packages/here/index.html

[7] Comprehensive R Archive Network (CRAN). (2019a, March 13). CRAN - Package Readxl. R-Project. https://cran.r-project.org/web/packages/readxl/index.html

[8] Comprehensive R Archive Network (CRAN). (2020, October 5). CRAN - Package readr. readr: Read Rectangular Text Data. https://cran.r-project.org/web/packages/readr/index.html

[9] Hao Zhu (2021). kableExtra: Construct Complex Table with 'kable'
  and Pipe Syntax. R package version 1.3.4.
  https://CRAN.R-project.org/package=kableExtra

[10] Xie, Y. (2021, April 22). CRAN - Package bookdown. bookdown: Authoring Books and Technical Documents with R Markdown. https://cran.r-project.org/web/packages/bookdown/index.html

[11] H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

[12] Comprehensive R Archive Network (CRAN). (2021a, January 10). CRAN - Package plotly. Plotly: Create Interactive Web Graphics via “Plotly.Js.” https://cran.r-project.org/web/packages/plotly/index.html

[13] Tierney, N. (2019, February 15). CRAN - Package visdat. Visdat: Preliminary Visualisation of Data. https://cran.r-project.org/web/packages/visdat/index.html

[14] Comprehensive R Archive Network (CRAN). (2021e, May 5). CRAN - Package dplyr. Dplyr: A Grammar of Data Manipulation. https://cran.r-project.org/web/packages/dplyr/index.html

[15] R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
https://www.R-project.org/.

[16] https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-21/readme.md

[17] https://globalassets.starbucks.com/assets/94fbcc2ab1e24359850fa1870fc988bc.pdf